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Abstract 

In this paper, we report on very efficient algorithms for the spherical harmonic transform (SHT) that 
can be used in numerical simulations of partial differential equations. Explicitly vectorized variations of 
the Gauss-Legendre algorithm are discussed and implemented in the open-source library SHTns which in- 
cludes scalar and vector transforms. This library is especially suitable for direct numerical simulations of 
non-linear partial differential equations in spherical geometry, like the Navier-Stokes equation. The per- 
formance of our algorithms is compared to third party SHT implementations, including fast algorithms. 
Even though the complexity of the algorithms implemented in SHTns are of order 0{N'^) (where N is the 
maximum harmonic degree of the transform), they perform much better than the available implementa- 
tions of asymptotically fast algorithms, even for a truncation as high as = 1023. In our performance 
tests, the best performance for SHT on the x86 platform is delivered by SHTns, which is available at 
https : / /bitbucket . org/nschaef f/shtns| as open source software. 
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1. Introduction 

Spherical harmonics are the eigenfunctions of the Laplace operator on a sphere. They form a basis and 
are very useful and convenient to describe data on a sphere in a consistent way in spectral space. Spherical 
Harmonic Transforms (SHT) are the spherical counterpart of the Fourier transform, casting spatial data to 
the spectral domain and vice versa. They are commonly used in various pseudo-spectral direct numerical 
simulations in spherical geometry, for simulating the Sun or the liquid core of the Earth among others 

All numerical simulations that take advantage of spherical harmonics use the classical Gauss-Legendre 
algorithm (see section [2]) with complexity 0{N^) for a truncation at spherical harmonic degree N. As a 
consequence of this high computational cost when N increases, high resolution spherical codes currently 
spend most of their time performing SHT. Two years ago, state of the art numerical simulations used 
A^ = 255 [l^. 

However, there exist several asymptotically fast algorithms 0; S S @, [3 1 but the overhead for these 
fast algorithm is such that they do not claim to be effectively faster for TV < 512. In addition, some of them 
lack stability and flexibility. 

Among the asymptotically fast algorithm, only two have open-source implementations, and the only one 
which seems to perform reasonably well is SpharmonicKit, based on the algorithms described by Healy et al. 
0. Its main drawback is the need of a latitudinal grid of size 2(iV-|- 1) while the Gauss-Legendre quadrature 
allows the use of only N + 1 collocation points. Thus, even if it were as fast as the Gauss-Legendre approach 
for the same truncation N , the overall numerical simulation would be slower because it would operate on 
twice as many points. These facts explain why the Gauss-Legendre algorithm is still the most efhcient 
solution for numerical simulations. 

A recent paper Q reports that carefully tuned software could finally run 9 times faster on CPU than the 
initial non-optimized version, and insists on the importance of vectorization and careful optimization of the 
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code. As the very goal of this work is to speed-up numerical simulations, we have written a highly optimized 

and explicitly vectorized version of the Gauss-Legendre SHT algorithm. The next section recalls the basics of 
spherical harmonic transforms. We then describe the optimizations we use and we compare the performance 
of our transform to several other SHT implementations. We conclude this paper by a short summary, a 
quick description of other features of the SHTns library, and perspectives for future developments. 



2. Spherical Harmonic Treinsform (SHT) 

2.1. Definitions and properties 

The orthonormalized spherical harmonics of degree n and order —n<m<n are functions defined on 
the sphere as: 

^7(0, 0) = P:\cose) exp(zm0) (1) 

where 6 is the colatitude, (f) is the longitude and are the associated Legendre polynomials normalized 
for spherical harmonics 



which involve derivatives of Legendre Polynomials Pn{x) defined by the following recurrence: 

Po{x) = 1 
Pi(x) =x 

nPnix) = {2n - 1) xPn-i{x) - (n - l)Pn-2{x) 
The spherical harmonics Y^{6, 4)) form an orthonormal basis for functions defined on the sphere: 

r r yn{0A)yh0,<j))s\neded<j) = 5r,i5mk (3) 
Jo Jo 

with 6ij the Kronecker symbol. By construction, they are eigenfunctions of the Laplace operator on the unit 
sphere: 

Ay„" = -n(n+l)y„" (4) 

This property is very appealing for solving many physical problems in spherical geometry involving the 
Laplace operator. 

2.2. Synthesis or inverse transform 

The Spherical Harmonic synthesis is the evaluation of the sum 

N n 

m<^) = E E (5) 

n=0 m=—n 

up to degree n = N, given the complex coefficients /™. If f{0,^) is a real-valued function, = (/™)*, 
where z* stands for the complex conjugate of z. 

The sums can be exchanged, and using the expression of we can write 

N I N \ 

m^)= E J2 fnPn{^OSe)\e'"^t' (6) 

m=-N \n=\m\ J 

From this last expression, it is obvious that the summation over m is a regular Fourier Transform. Hence 
the remaining task is to evaluate 

N 

fm{0)= E fnPnicOSe) (7) 
n=\m\ 

or its discrete version at given collocation points 9j . 

2 



2.3. Analysis or forward transform 

The analysis step of the SHT consists in computing the coefficients 



r r f{0,^)Y,T{e, cj,) sin Odd (8) 



The integral over (j) is easily obtained using the Fourier Transform: 

Ln{0)= r f{e,4>)e'"-U<t^ (9) 

Jo 

so the remaining Legendre transform reads 

fn = r /rn(e)/r(cOS 9) siu 6 dO (f 0) 



The discrete problem reduces to the appropriate quadrature rule to evaluate the integral PH)) knowing only 
the values fm{Qj)- In particular, the use of the Gauss-Legendre quadrature replaces the integral of expression 
[TO] by the sum 

Ng 

/r = E/™(^j)^n"(^o«^^H (11) 



where 9j and Wj are respectively the Gauss nodes and weights [13|. Note that the sum equals the integral 
if fm{9)P™ {cos 6) is a polynomial in cos 9 of order 2Ne — 1 or less. If fm{9) is given by expression [71 then 
fm{9)P™ {cos 9) is always a polynomial in cos0, of degree at most 2N . Hence the Gauss-Legendre quadrature 
is exact for Ng > iV + 1. 

A discrete spherical harmonic transform using Gauss nodes as latitudinal grid points and a Gauss- 
Legendre quadrature for the analysis step is refered to as a Gauss-Legendre algorithm. 



3. Optimization of the Gauss-Legendre algorithm 

3.1. Standard optimizations 

Let us first recall some standard optimizations found in almost every serious implementation of the 
Gauss-Legendre algorithm. All the following optimizations are implemented in the SHTns library. 

Use the Fast-Fourier Transform. The expressions of section [2] show that part of the SHT is in fact a Fourier 
transform. The fast Fourier transform should be used for this part, as it improves accuracy and speed. 
SHTns uses the FFTW libraryQ for the fast Fourier transform, a portable, flexible and blazingly fast FFT 
implementation. 

Take advantage of Hermitian symmetry for real data. When dealing with real- valued data, the spectral 
coefficients verify f~"^ — (/™)*, so we only need to store them for m > 0. This also allows the use of faster 
real-valued FFTs. SHTns only supports real- valued spatial fields. 

Take advantage of mirror symmetry. Due to the defined symmetry of spherical harmonics with respect to 
a reflexion about the equator 

p::{^'9) = {-ir+^^p::^{9) 

one can reduce by a factor 2 the operation count of both forward and inverse transforms. 

Precompute values of P^^ . The coefficients P^{cos9j) appear in both synthesis and analysis expressions 
(|71 and [TO]) . and can be precomputed and stored for all (n,m,j). When performing multiple transforms, it 
avoids computing the Legendre polynomial recursion at every transform and saves some computing power, 
at the expense of memory bandwidth. This may or may not be efficient, as we will discuss later. 
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Figure 1 : Two associated Legendre polynomials of degree n ■ 
the equator. 



: 40 and order m = 33 and m = 36, showing the localization near 



Polar optimization. High order spherical harmonics have their magnitude decrease exponentially when ap- 
proaching the poles as shown in figure [1] Hence, the integral of expression [TO] can be reduced to 



J 71 



(12) 



where 



is considered to be zero. Similarly, the synthesis of fm{0) 
(eq. [7]) is only needed for 0™" < 6 < tt — 0™". SHTns uses a threshold 6*™" that does not depend on n, 
which leads to around 5% to 15% speed increase, depending on the desired accuracy and the truncation N . 



> is a threshold below which P, 



Cache optimizations. The precomputed coefficients P^{cos9j) can be reordered and systematic zeros (due 
to polar optimization) can be stripped out, which leads to sensible speed increase. 

3.2. On-the-fly algorithms and vectorization on the x86 platform 
It can be shown that P™(a;) can be computed recursively by 

2\l™l/2 



KPn-2{^) 



with 



1 tist 



47r 



k=l 



2k 



An^ - 1 



2n + 1 [n- If 



2n - 3 



(13) 
(14) 
(15) 



(16) 



The coefficients a™ and 6™ do not depend on x, and can be easily precomputed and stored in an array of 
N{N + 1) values. This has to be compared to the order values of P™(xj), which are usually precomputed 
and stored in the spherical harmonic transforms implemented in numerical simulations. When N becomes 
very large, it is no longer possible to store P^{xj) in memory (for N > 1023 nowadays) and on-the-fly 
algorithms (where the values of P™ {xj ) are always computed from the recurrence relation when needed) are 
then the only possibility. 

Here, we would like to stress that even far from that storage limit, on-the-fly algorithm can be signiflcantly 
faster thanks to vector capabilities of modern x86 processors. Most desktop and laptop computers, as well as 
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on-the-fly 4-vector (AVX) 



on-the-fly 2-vector (SSE3) 



-♦ 



precomputed matrices 
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Figure 2: Efficiency (A^ + l)3/(2t/) of various algorithms, where t is the execution time and / the frequency of the Xeon E5- 
2680 CPU (2.7GHz). On-the-fly algorithms with two different vector sizes are compared with the algorithm using precomputed 
matrices. Note the influence of hardware vector size for on-the-fly algorithms (AVX vector pack 4 double precision floating 
point numbers where SSE3 vectors pack only 2). The efficiency of the algorithm based on precomputed matrices drops above 
AT = 127 probably due to cache size limitations. 

many high performance computing chisters, have support for Single-Instruction-Multiple-Data (SIMD) op- 
erations in double precision. The SSE2 instruction set is available since year 2000, and is currently supported 
by almost every PC, allowing to perform the same double precision arithmetic operations on a vector of 2 
double precision numbers, effectively doubling the computing power. The recently introduced AVX instruc- 
tion set increases the vector size to 4 double precision numbers. This means that the computation of 
by the recursion relation [13] (which requires 3 multiplications and 1 addition) will become more and more 
efficient, as P^ix) can be computed for 2 or 4 values of x simultaneously. Practically speaking, computing 
P™(a;) on-the-fly may be faster than loading pre-computed values from memory. Hence, as already pointed 
out by Dickson et al. Q , it is therefore very important to use the vector capabilities of modern processors to 
address their full computing power. Furthermore, when running multiple transforms on the different cores 
of a computer, the performance of on-the-fly transforms (which use less memory bandwidth) scales much 
better than algorithms with precomputed matrices, because the memory bandwidth is shared between cores. 
Superscalar architectures that do not have double-precision SIMD instructions but have many computation 
unit per core (like the POWER? or SPARC64) could also benefit from on-the-fly transforms by saturating 
the many computation units with independent computations (at different x). 

Figure [2] shows the benefit of explicit vectorization of on-the-fiy algorithms on an Intel Xeon E5-2680 
(Sandy Bridge architecture with AVX instruction set running at 2.7GHz) and compares on-the-fly algorithms 
with algorithms based on precomputed matrices. With the 4- vectors of AVX, the fastest algorithm is always 
on-the-fly, while for 2-vectors, the fastest algorithm uses precomputed matrices for N < 200. The efficiency 
drop for N = 2047 is due to the additional rescaling required to properly compute the recurrence relation 
at such high N with double-precision numbers. 

Runtime tuning. We have now two different available algorithm: one uses precomputed values for P^{x) 
and the other one computes them on-the-fly at each transform. The SHTns library compares the time taken 
by those algorithms (and variants) at startup and chooses the fastest, similarly to what the FFTW library^ 
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N 


libpsht (1 thread) 


libpsht (12 threads) 


DH (fast) 


SHTns 


63 


1.05 ms 


5.0 ms 


1.1 ms 


0.09 ms 


127 


4.7 ms 


5.4 ms 


5.5 ms 


0.60 ms 


255 


27 ms 


8.5 ms 


21 ms 


4.2 ms 


511 


162 ms 


23.5 ms 


110 ms 


28 ms 


1023 


850 ms 


125 ms 


600 ms 


216 ms 


2047 


4.4 s 


0.7 s 


NA 


1.9 s 


4095 


30.5 s 


3.0 s 


NA 


NA 



Table 1: Comparison of execution time for different SHT implementations. The numbers correspond to the average execution 
time for forward and backward scalar transform (including the FFT) on an Intel Xeon X5650 (2.67GHz) with 12 cores. The 
programs were compiled with gcc 4.4.5 and -03 -inarch=native -f fast-math compilation options. 

does. In the forthcoming years, AVX wih become widely available, and the benefits of on-the-fly vectorized 
transforms will become even more important. 

3.3. Performance comparisons 

Table [T] reports the timing measurements of two SHT libraries, compared to the optimized Gauss- 
Legendre implementation found in the SHTns library (this work). We compare with the Gauss-Legendre 
implementation of libpsht a parallel spherical harmonic transform library targeting very large N, 
and with SpharmonicKit 2.7 (DH) which implements one of the Driscoll-Healy fast algorithms Q. All 
the timings are for a complete SHT, which includes the Fast Fourier Transform. Note that the Gauss- 
Legendre algorithm is by far (a factor of order 2) the fastest algorithm of the libpsht library. Note also 
that SpharmonicKit is limited to + 1 being a power of two, requires 2(A^-|- 1) latitudinal colocation points, 
and crashed for N = 2047. The software library implementing the fast Legendre transform described in 

libftsh, has also been tested, and found to be of comparable peformance to that of SpharmonicKit, 
although the comparison is not straightforward because libftsh did not include the Fourier Transform. 
Again, that fast library could not operate at = 2047 because of memory limitations. Note finally that 
these measurements were performed on a machine that did not support the new AVX instruction set. 

In order to ease the comparison, we define the efRciency of the SHT by {N + 1)'^ / {2tfp), where t is the 
execution time (reported in table [1]), p the number of parallel threads (which is 12 for the parallel version of 
libpsht, and 1 otherwise) and / the frequency of the CPU (2.67GHz in this benchmark). This definition 
of efRciency takes into account the number of threads p, which is relevant for numerical simulations, where 
many SHT could be performed in parallel. For reference, we added the efficiency of the parallel algorithm 
with p set to 1, showing that even with 12 threads, libpsht performs better than SHTns only for N > 500. 
Note that {N + 1)^/2 reflects the number of computation elements of a Gauss-Legendre algorithm (the 
number of modes (iV -I- 1)(A^ + 2)/2 times the number of latitudinal points -I- 1). An efRciency that does 
not depend on N corresponds to an algorithm with an execution time proportional to N^. 

The efRciency of the tested algorithms are displayed in figure [31 Not surprisingly, the Driscoll-Healy 
implementation has the largest slope, which means that its efRciency grows fastest with N, as expected for a 
fast algorithm. It also performs slightly better than libpsht for N > 511. However, even for A'" = 1023 (the 
largest size that it can compute), it is still 2.8 times slower than the Gauss-Legendre algorithm implemented 
in SHTns. It is remarkable that SHTns achieves an efRciency very close to 1, meaning that almost one element 
per clock cycle is computed for A^ = 511 and A^ = 1023. Overall, SHTns is between two and ten times faster 
than the best alternative. We expect this gap to grow with the 4-vectors of the new AVX machines. 

3.4. Accuracy 

One cannot write about an SHT implementation without addressing its accuracy. Here, the Gauss- 
Legendre quadrature ensures very good accuracy, at least on par with other high quality implementations. 
To quantify the error we start with random spherical harmonic coefficients Q™ with each real part and 
imaginary part between —1 and +1. After a backward and forward transform (with orthonormal spherical 
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Figure 3: Efficiency {N + l)^/(2i/p) of the implementations from table ^ where t is the execution time, p the number of 
parallel threads and / the frequency of the Xeon X5650 CPU (2.67GHz) with 12 cores. 
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harmonics), we compare the resulting coefficients i?™ with the originals Q™. We use two different error 
measurements: the maximum error is defined as 

e^a. =max|i?r-Qri 

l,m 

while the root mean square (rms) error is defined as 

The error measurements for our on-the-fly Gauss-Legendre implementation with the default polar optimiza- 
tion and for various truncation degree N are shown on figure E) 

4. Conclusion and perspectives 

Despite the many fast spherical transform algorithms published, the few with a publicly available imple- 
mentation are far from the performance of a carefully written Gauss-Legendre algorithm, as implemented 
in the SHTns library, even for quite large truncation {N — 1023). Explicitly vectorized on-the-fly algorithm 
seem to be able to unleash the computing power of nowadays and future computers, without suffering too 
much of memory bandwidth limitations. Such kind of algorithm, that uses more computing resources but 
less memory bandwidth, should also be efficient on GPU machines, and future work will involve an OpenCL 
implementation. Finally, by choosing at runtime the fastest available SHT algorithm, the SHTns library 
will most certainly deliver the fastest spherical transform to your platform. The versatile truncation, the 
various normalization conventions supported, as well as the scalar and vector transform routines available 
for C/C-I--I-, Fortran or Python, should suit most of the current and future needs in high performance 
computing. 
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